Conversation
…ion through the VMDataset
johnwalz97
left a comment
There was a problem hiding this comment.
Couple of comments to reduce memory overhead
cachafla
left a comment
There was a problem hiding this comment.
Awesome. Let me do some testing with this notebook 🙂
|
Big memory savings: vm_ds = vm.init_dataset( Small dataset: Before: After: Bigger dataset: Before: After: |
PR SummaryThis pull request introduces several enhancements to the
These changes improve the flexibility and robustness of the dataset handling within the library, particularly for users dealing with large datasets or requiring specific data type handling. Test Suggestions
|
Internal Notes for Reviewers
This PR addresses an issue where some pandas
DataFramedtypes(e.g., categorical types) are lost duringVMDatasetinitialization. This change bypasses the VMDataset initialization to modify only theDataFrameDatasetclass to store the originalDataFramedirectly instead of converting toNumPyarrays and back. This ensures that all pandas-specificdtypeinformation and metadata are preserved.BEFORE:

AFTER:

Testing
Run successfully
quickstart_customer_churn_full_suite.ipynband theapplication_scorecard_executive.ipynbnotebooks.External Release Notes
VMDatasetobjects withvm.init_dataset()copy_dataoption toinit_dataset()to skip creating a copy of the input dataframe. This option helps dealing with large datasets in memory restricted environments. By default,copy_datais True. Example usage: